A stroke is a serious life-threatening medical condition that happens when the blood supply to part of the brain is cut off or when a blood vessel in the brain bursts. In either case, brain cells can be damaged or die.
The impact of stroke can be short- and long-term, depending on which part of the brain is affected and how quickly it is treated. A stroke can cause lasting brain damage, wide-ranging disabilities, or even death.
According to the World Health Organization (WHO) stroke is the 2nd leading cause of death globally, responsible for approximately 11% of total deaths. Over 110 million people in the world have experienced stroke. [1]
Specifically also in Italy is the 2nd cause leading cause of death indeed almost 200,000 cases are recorded every year and this phenomenon is increasing constantly since the ageing of the population. [2]
Nowadays prevention is an important public health concern, so for this reason is crucial to understand better its causes and improve the capacity to predict if a person could have a stroke in her life.
This dataset, that I have taken from Kaggle, is used to predict whether a patient is likely to get stroke based on the input parameters like gender, age, various diseases, and smoking status. Each row in the data provides relevant information about the patient. In particular this dataset considers 5110 patients and for each of them 10 features are specified. Moreover each patient has a label which points out if she had a stroke or not. [3]
In this dataset some missing values (201) occur only in the BMI column and they will be adequately treated.
Who builds this dataset has collected for each patient the following information:
To have a clear idea of the dataset and to understand who are the patient which I am observing I have made for each categorical variable a pie-chart counting for each category the number of occurrences and consequently the percentage of that category. To do these visualizations I have followed an online guide which uses highcharter and dplyr as library. [12]
Regarding the gender of the patients, dataset is pretty balanced, even if females are the majority. Just one person has declared as gender “Other”.
From the second pie-chart instead it is observable that less than 10% of the patients are affected by hypertension, or high blood pressure.
Even less people suffer of some heart diseases, indeed just 5.4% are affected by them.
From the pie-chart about married people it is possible to notice that almost 2 people out of 3 have been married at least one time in their lifetime.
Patients are pretty balanced from the point of view of their residence: circa the half of them live in a city while the other half live in the countryside.
Just 22 patients have never worked, while the majority of them work privately. There are almost 700 children while the the remaining part of the patients have a governament job or are self-employed.
The smoking status of more than 1500 patients is unknown while almost 1900 ones have never smoked. Just circa 800 people actually smokes. Since the smoking status of a lot of people is unknown I have decided to treat “unknown” as a category, and not as missing values also because maybe there are people who smoke electronic cigarettes or smoke once in a while so it would be difficult classify them.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.08 25.00 45.00 43.23 61.00 82.00
I have tried to represent the distribution of the age of patients through an histogram in which it is observable that the majority of people are between 50 and 55 years while the bin that contains less people is the one between 80 and 85 people. In general the majority of the patients are adults, but in the overall this distribution is quite uniform.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 55.12 77.25 91.89 106.15 114.09 271.74
The distribution of the average glucose level seems a Bimodal distribution in which the groups are healthy and diabetic people. The two peaks are in correspondence of 80 and 210 mg/dl. These values are reasonable indeed healthy people have an average glucose level between 70 and 99 mg/dl. While a level over 130 mg/dl is a symptom of the diabetes. So I have thought that the majority of people is healthy while just few patients could be affected by the diabetes. [4]
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 10.30 23.80 28.40 28.89 32.80 97.60
The distribution of the BMI seems a Gaussian in which the peak is in correspondence of BMI between 26 and 27. I am not considering the peak in correspondence between 28 and 29 because that one is due to the missing values. (The mean of BMI is 28.89) These values are reasonable indeed normal weight people have a BMI between 18.5 and 25 while BMI between 25 and 35 points out a light overweight in that person. [5]
Now I have filtered the dataset according to the label so to separate healthy people and patients who had the stroke and then I have plotted the distributions of the continuous variables for each of the two group. I have done it to observe if considering just people who had the stroke the distributions of these features would change.
As expected the distribution of the age of just sick patient is not anymore pretty uniform, but it is strongly shifted towards higher age and the distribution has a peak at 78 years. This plot confirm the intuitive idea that people who had a stroke are generally older: the majority of them are adults older than 53 years old.
Also the distribution of the BMI of sick people is slightly shifted towards higher values, and it has peak in correspondence of the mean of BMI, circa 28.7. So I think that a lot of missing values in BMI column belong to patients who had a stroke and generally they weight a bit more than healthy patients.
The distribution of the average glucose level is still a mixture of Gaussian with two components, but people who had a stroke as expected tend to have an higher level of glucose and so the majority of them belong to the second bell, while most of the healthy people have an average glucose level in the norm.
To conclude I have observed that there are just 249 patients so just the 4.9% of the entire dataset had a stroke, while all the others 4861 are healthy. This is the main drawback of this dataset, strongly unbalanced, and it would be better have a more balanced dataset in which the number of patients who had stroke was comparable with healthy patients.
The imbalance problem could bias classification algorithms to majority class and make classifiers have bad classification performance on minority class. Such classifiers are not useful in real world tasks, because usually the classification performance of the minority samples is of higher importance for decision making in the healthcare area.[6]
To cope with this problem it would be necessary to resort to undersampling or oversampling techniques. But since the former lead to information loss (and in this specific case throwing away examples is not a good ideas since the dataset is quite small), and the latter suffer from huge computational cost as first step I am going to tackle this dataset as it is and only in a second step I am going to apply one of these techniques to observe what are the changes.
To build my model, but also to draw the correlation matrix among my features I cannot have characters in my dataset, so first of all I have performed an ordinal encoding to pass from character to integer type. For example for the feature gender I have assigned 0 to males, 1 to females and 2 to others. I have done it in analogous way for the features ever_married, Residence_type,work_type and smoking_status.
The correlations are measured considering the Pearson formula:
\(\rho_{XY} = \frac{Cov(X,Y)}{\sigma_X \cdot \sigma_Y}\) Where:
From the correlation matrix it is notable that:
gender and Residence_type are uncorrelated to the rest of features
age is correlated mainly with ever_married, work_type. It is reasonable because older people tend to be married or they have faced a marriage in their life and probably they work (not children and never worked class). Indeed this negative correlation is due to how I have assigned the number to the class. The age is also linked to the smoking status, the BMI, but also to hypertension and heart diseases. So the age will be an important features because it could allow to understand a lot about a patient and the majority of the other features are related to it in some way.
In this case I have not an excessive number of features and I could have used all, but I have tried to choose only those predictors which are necessary.
Feature selection is the process of reducing the number of input variables when developing a predictive model.
To do it I have look at an online guide [9]. In the Backward Elimination the model at the beginning is trained with all the predictors, but iteratively the least contributive predictor is removed. The training stops when you have a model where all predictors are statistically significant. I have read that some methods remove the features using the p-value while others evaluate the model through some metrics like RMSE.
Based on this training, I have plot the importance using the function varImp that tracks the changes in model statistics (RSS = residual sum of squares) and accumulates the reduction in the statistic when each predictor’s feature is added to the model. This total reduction is used as the variable importance measure. It is a positive number. [10]
As expected the age of the patient is the feature most important, followed by the average glucose level, hypertension and heart disease. While gender and residence type that are the features uncorrelated to the others, they have no importance.
As predictors I have chosen just the two most important features, because basically I had obtained the same results if I would have chosen the first four features.
The main goal is to do a Bayesian analysis, based on understanding if a patient had a stroke or not. First of all I have split the dataset randomly into training, 70%, and test, 30%. The response variable, stroke can be modeled as a Bernoulli distribution with a success probability (had a stroke) equals to \(p_i\).
So as first model, I have decided to consider the following Logistic Regression model with the logit as link function (linking probabilities with outputs. Its inverse is the sigmoid):
\(logit(p_i) = log\Big(\frac{p_i}{1-p_i}\Big)= \beta_{0} + \beta_{2} \cdot x_{2_i} + \beta_{8} \cdot x_{8_i}\)
So it is a linear model of the logarithm of the odds (\(\frac{p}{1-p}\)) indeed the logit is also called the log-odds since it is equal to the logarithm of the odds.
The prior beta parameters are distributed following a normal distribution considering \(\mu = 0\) and \(\tau^2 = 0.000001\). Notice that \(\tau^2 = \frac{1}{\sigma^{2}}\)
I have defined \(\beta_i\) as Normal because they can take values between all real numbers and with that precision the prior distributions are not only concentrated on the mean.
I have implemented the model using RJags and it is defined as follows:
#seed
set.seed(1234)
#sampling indices from the dataset
idxtr <- sample(1:nrow(dat), 0.7*nrow(dat))
dat_train <- dat_1[idxtr,]
dat_test <- dat_1[-idxtr,]
# number of train
N <- nrow(dat_train)
model <- function(){
for (i in 1:N){
y[i] ~ dbern(p[i])
logit(p[i]) <- beta0 + beta2*x2[i] + beta8*x8[i]
}
# Defining the prior beta parameters
beta0 ~ dnorm(0, 1.0E-6)
beta2 ~ dnorm(0, 1.0E-6)
beta8 ~ dnorm(0, 1.0E-6)
}
# Passing the data for RJags
y <- as.vector(dat_train$stroke)
data.jags <- list("y" = y, "N" = N,
"x2" = x2,
"x8" = x8)
# Defining parameters of interest to show after running RJags
mod.params <- c("beta0", "beta2", "beta8")
# Run JAGS
n.chains <- 3
t_start = Sys.time()
mod.fit <- jags(data = data.jags, # DATA
model.file = model, # MODEL
parameters.to.save = mod.params, # TRACKING
n.chains = n.chains, n.iter = 10000, n.burnin = 1000, n.thin = 10) # MCMC
## module glm loaded
## Compiling model graph
## Resolving undeclared variables
## Allocating nodes
## Graph information:
## Observed stochastic nodes: 3577
## Unobserved stochastic nodes: 3
## Total graph size: 20945
##
## Initializing model
tempo_fit1 = round(Sys.time() - t_start,2)
mod.fit$BUGSoutput$summary
## mean sd 2.5% 25% 50%
## beta0 -3.9163250 0.16732738 -4.231497 -4.0126909 -3.9165997
## beta2 1.6182035 0.13353232 1.383811 1.5376977 1.6160712
## beta8 0.2400783 0.05922978 0.124735 0.2004267 0.2404545
## deviance 1198.2001167 19.97158783 1194.802380 1195.8114164 1196.9473119
## 75% 97.5% Rhat n.eff
## beta0 -3.8154030 -3.6412273 1.003179 730
## beta2 1.6992859 1.8734902 1.002570 940
## beta8 0.2808708 0.3514042 1.001462 2000
## deviance 1198.6177947 1203.6226599 1.000477 2700
## The DIC of the first model is: 1397.77522523058
I have not specified the inits so the starting values of the three chain is random accordingly to the prior distribution. My prior distribution is very flat so the starting point from the region where the posterior distribution will concentrate. But this will not be a problem in this case.
\(\beta_0\) is the intercept while \(\beta_i\) measures the marginal impact of the predictor \(X_i\) on the log-odds in favor of \(Y = 1\).
The summary shows the mean, point estimate for \(beta_i\) and sd that tells how variable is the posterior distribution around this point estimate. The larger the sd the larger the credible interval.
I observe that all the estimated mean of \(\beta_i\) are not zero and zero is not contained in the credible intervals so there is an evidence that the chosen predictors are significant for the model. Indeed if \(\beta_i\) was zero it would mean that the corresponding predictors \(x_i\) is not having an effect on the model.
n.eff is the effective sample size that can be considered as the number of independent Monte Carlo samples necessary to same precision of the MCMC samples. The greater this value, the lower the autocorrelation for that component. The idea is to have a sort of “exchange rate” between dependent and independent samples. To have some less dependence structure I have selected n.thin = 10 so I take one iteration every ten. In this way I throw away a lot of iterations and I spend more to time to have an high number of them. Moreover I have thrown out the first 1000 iterations, burn-in time.
deviance is a transformation of the likelihood function evaluation corresponding to all the parameters which are simulated in the MC. It is a sort of indication of whether I hit the most interesting region of the posterior distribution. I want that deviance decreases until it becomes stationary when it is reached the correct region.
DIC is the Deviance Information Criterion and it is the most popular criterion to select alternative models. DIC can be used only as a comparative index: there is no way of considering DIC on a standard absolute scale so it is useful only when different models are compared on the same data set. Lower values are better.
Rhat is the potential scale reduction. Gelman and Rubin diagnostic provide this quantity. It is related to the situation when there is more than one Markov chain. Comparison between variability among means of every MC wrt the overall mean and the variability of each MC. The values near 1 suggest convergence
I show the uni-variate trace-plots of the simulations of each parameter. The idea of tracing plot is to look at the single component simulated over time. In fact if MC behaves properly tracing plot starts going up and down because it has been reached the stationarity region and it is exploring it in the appropriate way.
Then I look also at the posterior distribution of each component of the parameter set to both the overlapping of the different chains and the posterior masses that is away from zero so an evidence that parameter matters.
Now, I have plotted the autocorrelation function plots that are related to the way in which the next time point is related to the previous time points. In the y-axis there is an estimate of \(Cor(\theta_t, \theta_{t+h})\), in other words how strong is the dependence of one simulation depending on the previous one. We expect if the process is stationary that the dependence becomes less and less vanishing eventually zero. So function stabilizes around zero. More slower it goes to zero the stronger is the dependence structure in the chain.
The ACFs follow good behaviours, because samples have to be independent each other during new iterations. The correlations are going further from the first lag so this is a good point!
These are the plots of the running means: I am taking the empirical average \(\mathbf{\hat{I}}_{t}\), approximation of the theoretical expectation at different times (using different number of actual simulations) and I see the converging behavior increasing the number of simulation. I can say what are the parameters that suffers more of autocorrelation and they will have an approximation error greater than iid case.
Each parameter achieves in all of the three chains generated the same end point, so means that with different initial points in these three chains, I am strictly going to have the same estimated mean parameter.
Now I want to analyze the best approximation error for the MCMC sampling. I consider, essentially the square root of the MCSE.
The variance formula in the MCMC sampling is:
\(\mathbf{V}[\hat{I}_{t}] = \frac{Var_{\pi}[h(X_{1})]}{t_{eff}} = \Big( 1 + 2 \sum_{k=1}^{\infty} \rho_{k}\Big)\frac{\sigma^{2}}{T}\)
The variance in MCMC algorithm is usually larger than iid simulation so MC is a bit inefficient. The more inefficient if the correlation is strong and positive.
Let’s move on to calculate the MCSE: in this case we want to consider the MCSE that is the square root of the formula written above:
As we can see the \(\beta_2\) has the highest approximation error considering the jointly chains.
The uncertainty is measured doing the ratio between the standard deviation of the parameter and its absolute expectation:
The highest posterior uncertainty is about the \(\beta_8\).
Now I have drawn the correlations between all the values calculated during the MCMC sampling:
There is an high correlation between the intercept \(\beta_0\) and \(\beta_2\).
## beta0 beta2 beta8 deviance
## -3.9163250 1.6182035 0.2400783 1198.2001167
I have decided to pool together all the chains and create the credible intervals and the point estimates for our estimated beta values.
## beta0 beta2 beta8 deviance
## 2.5% -4.231497 1.383811 0.1247351 1194.802
## 97.5% -3.641227 1.873490 0.3514042 1203.623
They are smaller or equal than equal tail credible intervals.
## lower upper
## beta0 -4.215264 -3.6280090
## beta2 1.382442 1.8721261
## beta8 0.125127 0.3519389
## deviance 1194.619677 1202.3076540
## attr(,"Probability")
## [1] 0.95
It’s relevant to reaffirm that all the parameters are significant to predict if a patient had a stroke or not.
It would be interesting to observe if these multiple simulations of the markov chains achieve the convergences (formally convergence is never reached) and the validity of the stationarity regions. To observe it I have used the following tests:
“Geweke (1992) proposed a convergence diagnostic for Markov chains based on a test for equality of the means of the first and last part of a Markov chain (by default the first 10% and the last 50%). If the samples are drawn from the stationary distribution of the chain, the two means are equal and Geweke’s statistic has an asymptotically standard normal distribution.” [11]
“The test statistic is a standard Z-score: the difference between the two sample means divided by its estimated standard error.”
## Z-score chain 1 Z-score chain 2 Z-score chain 3
## beta0 0.9812117 -0.8092299 -0.3036849
## beta2 -0.8764156 0.6357221 0.4122321
## beta8 0.4329297 -0.1148482 -0.1686733
## deviance 1.1257892 1.0990148 1.0146728
“The convergence test uses the Cramer-von-Mises statistic to test the null hypothesis that the sampled values come from a stationary distribution. The test is successively applied, firstly to the whole chain, then after discarding the first 10%, 20%, … of the chain until either the null hypothesis is accepted, or 50% of the chain has been discarded. The latter outcome constitutes ‘failure’ of the stationarity test and indicates that a longer MCMC run is needed. If the stationarity test is passed, the number of iterations to keep and the number to discard are reported.
The half-width test calculates a 95% confidence interval for the mean, using the portion of the chain which passed the stationarity test. Half the width of this interval is compared with the estimate of the mean. If the ratio between the half-width and the mean is lower than eps, the halfwidth test is passed. Otherwise the length of the sample is deemed not long enough to estimate the mean with sufficient accuracy.”
##
## Stationarity start p-value
## test iteration
## beta0 passed 1 0.1398
## beta2 passed 1 0.2006
## beta8 passed 1 0.4386
## deviance passed 1 0.0522
##
## Halfwidth Mean Halfwidth
## test
## beta0 passed -3.925 0.0183
## beta2 passed 1.623 0.0143
## beta8 passed 0.243 0.0039
## deviance passed 1198.075 1.3675
The majority of the values are inside the acceptance area so it is almost always accepted the equality of the means.
Now it is interesting to observe the performance of the model on the test set, observations that the model still has not used.
## The overall accuracy is: 0.924331376386171
## The balanced accuracy is: 0.513746255883611
In a highly imbalanced dataset, say a binary dataset with a class ratio of 95:5, a model that always predicts the majority class and completely ignores the minority class will still be 95% correct. This renders measures like classification accuracy meaningless. For this reason I have also printed the balanced accuracy that is more relevant in this case because it gives same weight to both classes. It is defined as the arithmetic mean of sensitivity (TP / (TP + FN)) and specificity (TN /(TN + FP)). [8]
Let’s move to introduce a new model, in order to see if it’s better to change the actual model or not.
To have a comparison with the first model, I have implemented another linear classifier which uses the same predictors but another link function, the complementary log-log. \(cloglog(p_i) = log(-log(1 - p_i) = \beta_{0} + \beta_{2} \cdot x_{2_i} + \beta_{8} \cdot x_{8_i}\)
I have wanted to try this function because I have read that it often produce different results from the logit and probit link functions.
model2 <- function(){
# Likelihood
for (i in 1:N){
y[i] ~ dbern(p[i])
cloglog(p[i]) <- beta0 + beta2*x2[i] + beta8*x8[i]
}
# Defining the prior beta parameters
beta0 ~ dnorm(0, 1.0E-6)
beta2 ~ dnorm(0, 1.0E-6)
beta8 ~ dnorm(0, 1.0E-6)
}
# Passing the data for RJags
y <- as.vector(dat_train$stroke)
data.jags <- list("y" = y, "N" = N,
"x2" = x2, "x8" = x8)
# Defining parameters of interest to show after running RJags
mod.params <- c("beta0", "beta2", "beta8")
# Run JAGS
n.chains <- 3
t_start = Sys.time()
mod.fit2 <- jags(data = data.jags, # DATA
model.file = model2, # MODEL
parameters.to.save = mod.params, # TRACKING
n.chains = n.chains, n.iter = 10000, n.burnin = 1000, n.thin = 10) # MCMC
## Compiling model graph
## Resolving undeclared variables
## Allocating nodes
## Graph information:
## Observed stochastic nodes: 3577
## Unobserved stochastic nodes: 3
## Total graph size: 20945
##
## Initializing model
tempo_fit2 = round(Sys.time() - t_start,2)
mod.fit2$BUGSoutput$summary
## mean sd 2.5% 25% 50%
## beta0 -3.9133334 0.14970955 -4.220175 -4.0099045 -3.9109619
## beta2 1.5402178 0.12408457 1.303552 1.4552406 1.5359490
## beta8 0.2148722 0.05332664 0.110588 0.1800098 0.2137535
## deviance 1198.9349945 2.48715155 1196.139640 1197.1220581 1198.3105027
## 75% 97.5% Rhat n.eff
## beta0 -3.8098916 -3.6322374 1.001364 2200
## beta2 1.6224468 1.7915566 1.001818 1500
## beta8 0.2512178 0.3219771 1.002506 970
## deviance 1200.0222349 1205.3354464 1.000676 2700
## The DIC of the model is: 1202.02952972316
## beta0 beta2 beta8 deviance
## Lag 0 1.000000000 1.00000000 1.0000000000 1.000000000
## Lag 10 0.143597033 0.14704870 0.0149743703 0.024427512
## Lag 50 0.043870446 0.01768970 -0.0025473162 0.025805568
## Lag 100 -0.045003985 -0.02165894 0.0004860988 0.008186704
## Lag 500 -0.001009569 0.01477181 -0.0179452712 -0.004575765
## The overall accuracy is: 0.921722113502935
## The balanced accuracy is: 0.562990300955641
Here my goal is to balance my dataset and since my dataset is quite small I have preferred to apply SMOTE (Synthetic Minority Oversampling Technique) rather than an undersampling technique.
Through SMOTE the new instances are not just copies of existing minority cases, but it is an algorithm loops over each observation in the minority class. At each loop iteration, it identifies its k nearest neighbors. After the neighbors have been selected, the algorithm takes the difference between the feature vector of the minority member and the individual neighbors. Each difference is then multiplied by a random number in (0, 1). This constructs synthetic observations in the direction of each member, which are located at a random distance from the minority member. So new synthetic minority instances are generated somewhere between the minority instance and that neighbor.[7]
So I have reused my first model and repeat my analysis but with the new augmented dataset. I show how the proportion between healthy people and patients who had stroke is changed.
## Compiling model graph
## Resolving undeclared variables
## Allocating nodes
## Graph information:
## Observed stochastic nodes: 6714
## Unobserved stochastic nodes: 3
## Total graph size: 42855
##
## Initializing model
## mean sd 2.5% 25% 50%
## beta0 -0.2877761 0.03483902 -0.3556595 -0.3112692 -0.2874134
## beta2 1.8770869 0.05705053 1.7820161 1.8436084 1.8753143
## beta8 0.3584680 0.03274133 0.2938553 0.3364891 0.3573909
## deviance 6202.4122320 12.43184602 6199.2749508 6200.2589943 6201.4606128
## 75% 97.5% Rhat n.eff
## beta0 -0.2646135 -0.2226060 1.000776 2700
## beta2 1.9111031 1.9812644 1.001212 2600
## beta8 0.3811315 0.4212129 1.001902 1400
## deviance 6203.0934790 6208.4687095 1.007016 2700
## The DIC of the model is: 6279.74108420959
I cannot compare this DIC with others obtained before because now I am not using the same data, so I cannot compare likelihoods!
## beta0 beta2 beta8 deviance
## Lag 0 1.000000000 1.000000000 1.000000000 1.0000000000
## Lag 10 -0.004323259 0.034121949 -0.012693617 0.0012711267
## Lag 50 -0.018067068 -0.006995527 0.034627035 -0.0005451104
## Lag 100 -0.022247134 -0.013609026 0.018607380 -0.0009990836
## Lag 500 -0.014029930 -0.001145892 -0.009079555 -0.0037013300
## The overall accuracy is: 0.696316886726894
## The balanced accuracy is: 0.696221520088121
Here I print a summary of all the measures taken along the way:
Comparing the DIC, the second model is better than the first one. Furthermore it shows a better balanced accuracy, and this is very important because the minority class is the most relevant. On the other hand the time to run it is more than three times greater. Here it is acceptable because the original dataset is quite small, but in other situation this could be a problem. For example on the augmented dataset I have reused the first model and not the second one because, increasing the observation doubles the running time. However the balanced accuracy is increased a lot, up to 0.69 so it was valuable applied this technique to the dataset.
I pointed different new improvements that could be done in a future time:
Create different models using different binary classifiers like: SVM, K-Nearest Neighbours, Decision Tree etc…
More interesting plots to describe the data
Consider, after oversampling the dataset, a validation set to tune better the parameters
Use other link functions
Add other metrics in the predictions part like: F1 Score, Recall, Precision etc..
https://www.world-stroke.org/world-stroke-day-campaign/why-stroke-matters/learn-about-stroke#:~:text=Globally%201%20in%204%20adults,the%20world%20have%20experienced%20stroke.
https://www.kaggle.com/datasets/fedesoriano/stroke-prediction-dataset
https://my.clevelandclinic.org/health/diagnostics/12363-blood-glucose-test#:~:text=A%20blood%20glucose%20test%20is,indicate%20pre%2Ddiabetes%20or%20diabetes.
https://www.dominodatalab.com/blog/smote-oversampling-technique
https://neptune.ai/blog/balanced-accuracy#:~:text=Balanced%20Accuracy%20is%20used%20in,lot%20more%20than%20the%20other.
Lectures slides
https://www.kaggle.com/code/nulldata/beginners-guide-to-highchart-visual-in-r/report